Chapter 6: Partial pooling and likelihood

Partial pooling is one of the primary motivations behind conventional hierarchical models. Also termed “borrowing information”, partial pooling improves estimates of group-level parameters, particularly when we have \(>3\) groups and/or varying sample sizes among groups. In this chapter, we illustrate partial pooling and link it to prior distributions in a maximum-likelihood context.

Learning goals

  • motivation for and definition of partial pooling
  • simple hierarchical models with likelihood
  • hyperparameters
  • varying intercepts (NBA freethrow example) with lme4
  • partial pooling
  • clearing up confusion about nestedness
  • predictors for multiple levels
  • plotting estimates for different levels from lme4 models

Partial pooling: free throw example

Often, data are structured hierarchically. For instance, maybe we sample individual animals within sites, with multiple sites. Or, perhaps we sequence multiple genes across multiple individuals. There may be more than two levels, for instance if we sample parasites of different species within individuals of different species of hosts across multiple sites in a landscape. Commonly, sample sizes are not equal across units at various levels. The following example demonstrates how to use partial pooling to generate reliable estimates in the context of wildly varying sample sizes.

Suppose we are interested in knowing who the best free throw shooter was in the 2014-2015 NBA season. We can pull the data from the web, and plot the proportion of free-throws made by player.

rawd <- read.csv("leagues_NBA_2015_totals_totals.csv")
str(rawd)
## 'data.frame':    651 obs. of  30 variables:
##  $ Rk    : int  1 2 3 4 5 5 5 6 7 8 ...
##  $ Player: Factor w/ 492 levels "Aaron Brooks",..: 384 249 438 212 36 36 36 8 152 85 ...
##  $ Pos   : Factor w/ 11 levels "C","PF","PF-SF",..: 2 9 1 2 9 9 9 1 2 1 ...
##  $ Age   : int  24 20 21 28 29 29 29 26 23 26 ...
##  $ Tm    : Factor w/ 31 levels "ATL","BOS","BRK",..: 20 15 21 18 29 8 25 19 23 20 ...
##  $ G     : int  68 30 70 17 78 53 25 68 41 61 ...
##  $ GS    : int  22 0 67 0 72 53 19 8 9 16 ...
##  $ MP    : int  1287 248 1771 215 2502 1750 752 957 540 976 ...
##  $ FG    : int  152 35 217 19 375 281 94 181 40 144 ...
##  $ FGA   : int  331 86 399 44 884 657 227 329 78 301 ...
##  $ FG.   : num  0.459 0.407 0.544 0.432 0.424 0.428 0.414 0.55 0.513 0.478 ...
##  $ X3P   : int  18 10 0 0 118 82 36 0 0 0 ...
##  $ X3PA  : int  60 25 2 0 333 243 90 0 5 0 ...
##  $ X3P.1 : num  0.3 0.4 0 NA 0.354 0.337 0.4 NA 0 NA ...
##  $ X2P   : int  134 25 217 19 257 199 58 181 40 144 ...
##  $ X2PA  : int  271 61 397 44 551 414 137 329 73 301 ...
##  $ X2P.1 : num  0.494 0.41 0.547 0.432 0.466 0.481 0.423 0.55 0.548 0.478 ...
##  $ eFG.  : num  0.486 0.465 0.544 0.432 0.491 0.49 0.493 0.55 0.513 0.478 ...
##  $ FT    : int  76 14 103 22 167 127 40 81 13 50 ...
##  $ FTA   : int  97 23 205 38 198 151 47 99 27 64 ...
##  $ FT.   : num  0.784 0.609 0.502 0.579 0.843 0.841 0.851 0.818 0.481 0.781 ...
##  $ ORB   : int  79 9 199 23 27 21 6 104 78 101 ...
##  $ DRB   : int  222 19 324 54 220 159 61 211 98 237 ...
##  $ TRB   : int  301 28 523 77 247 180 67 315 176 338 ...
##  $ AST   : int  68 16 66 15 129 101 28 47 28 75 ...
##  $ STL   : int  27 16 38 4 41 32 9 21 17 37 ...
##  $ BLK   : int  22 7 86 9 7 5 2 51 16 65 ...
##  $ TOV   : int  60 14 99 9 116 83 33 69 17 59 ...
##  $ PF    : int  147 24 222 30 167 108 59 151 96 122 ...
##  $ PTS   : int  398 94 537 60 1035 771 264 443 93 338 ...

Some players switched teams mid-season, and they appear on separate rows, one for each team they played on. We need to aggregate across the teams, so that we end up with only one row per player:

library(dplyr)
library(ggplot2)

# clean the data
d <- rawd %>%
  group_by(Player) %>%
  summarize(ft_made = sum(FT), 
            ft_miss = sum(FTA) - sum(FT),
            ft_shot = sum(FTA), 
            ft_pct = sum(FT) / sum(FTA)) %>%
  subset(ft_shot != 0) %>%
  arrange(-ft_pct) %>%
  droplevels()
d
## Source: local data frame [475 x 5]
## 
##                   Player ft_made ft_miss ft_shot ft_pct
##                   (fctr)   (int)   (int)   (int)  (dbl)
## 1              Alex Kirk       2       0       2      1
## 2  Chris Douglas-Roberts       8       0       8      1
## 3            C.J. Wilcox       2       0       2      1
## 4          Grant Jerrett       2       0       2      1
## 5              Ian Clark      18       0      18      1
## 6          Jannero Pargo       2       0       2      1
## 7           Jerel McNeal       2       0       2      1
## 8         John Lucas III       5       0       5      1
## 9          Kenyon Martin       2       0       2      1
## 10          Luigi Datome       2       0       2      1
## ..                   ...     ...     ...     ...    ...
# order factor levels by ft_pct and plot
levels(d$Player) <- levels(d$Player)[order(-d$ft_pct)]
d$shooter <- factor(d$Player, levels = d$Player[order(d$ft_pct)])
ggplot(d, aes(x=ft_pct, y=shooter)) + 
  geom_point(stat='identity') + 
  theme(text = element_text(size=8))

Wow! Looks like we have some really good (100% accuracy) and really bad (0% accuracy) free throw shooters in the NBA. We can verify this by calculating maximum likelihood estimates for the probability of making a free throw for each player. We’ll assume that the number of free throws made is a binomial random variable, with \(p_i\) to be estimated for the \(i^{th}\) player, and \(k\) equal to the number of free throw attempts.

\[y_i \sim Binom(p_i, k_i)\]

# fit binomial glm
m <- glm(cbind(ft_made, ft_miss) ~ 0 + Player, 
         family=binomial, data=d)

# print estimated probabilities
probs <- m %>%
  coef() %>%
  plogis() %>%
  round(digits=4) %>% 
  sort(decreasing=TRUE)

It turns out that the maximum likelihood estimates are equal to the proportion of free throws made.

plot(d$ft_pct, probs, 
     xlab="Empirial proportion FT made", 
     ylab="MLE: Pr(make FT)")

But, can we really trust these estimates? What if we plot the maximum likelihood estimates along with the number of free throw attempts?

ggplot(d, aes(x=ft_shot, y=ft_pct)) + 
  geom_point(alpha=.6) + 
  xlab('Free throw attempts') + 
  ylab('Proportion of free throws made')

Well that’s not good. It looks like the players with the highest and lowest shooting percentages took the fewest shots. We should be skeptical of these estimates. One solution is to select some minimum number of attempts, and only believe estimates for players who have taken at least that number. This is what the NBA does, and the limit is 125 shots.

d %>%
  filter(ft_shot >= 125) %>%
  ggplot(aes(x=ft_shot, y=ft_pct)) + 
  geom_point(alpha=.6) + 
  xlab('Free throw attempts') + 
  ylab('Proportion of free throws made')

This seems somewhat arbitrary - what’s special about the number 125? Is there a better way to decide which estimates to trust? What if instead of tossing out the data from players that haven’t taken at least 125 shots, we tried to somehow improve those estimates? For instance, we might instead pull these estimates towards the average, and place increasingly more trust in proportions from players with more information (shot attempts). This is the intuition behind partial pooling, and the secret to implementation lies in placing a prior distribution on \(p_i\), the probability that player \(i\) makes a free throw, so that:

\[y_i \sim Binom(p_i, k_i)\]

\[logit(p_i) \sim N(\mu_p, \sigma_p)\]

such that the likelihood to maximize is:

\[[y_i \mid p_i] [p_i \mid \mu_p, \sigma_p]\]

where \(\mu_p\) represents the overall league (among player) mean on a logit scale for the probability of making a free thrower, and \(\sigma_p\) represents the variability among players in the logit probability of making a free throw. This type of model is sometimes called a varying-intercept or random-intercept model. Because \(\mu_p\) and \(\sigma_p\) determine the distribution of the parameter \(p\), they are known as hyperparameters. This model approaches the previous model with no hyperparameters when \(\sigma_p\) approaches \(\infty\). A similar strategy would be to place a beta prior directly on \(p_i\), rather than placing a prior on \(logit(p_i)\).

This model is hierarchical in the sense that we have a within player-level model (each player gets their own \(p_i\)) and an among-player model (with \(\sigma_p\) controlling the among-player variaiton). We will implement the logit-normal model in a maximum likelihood context to begin with, where we find maximum likelihood estimates for the hyperparameter \(\sigma_p\), but unfortunately this approach does not account for any hyperparameter uncertainty.

library(lme4)
m2 <- glmer(cbind(ft_made, ft_miss) ~ (1|Player), 
         family=binomial, data=d)
summary(m2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(ft_made, ft_miss) ~ (1 | Player)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##   3394.0   3402.3  -1695.0   3390.0      473 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.41341 -0.28396  0.01969  0.28122  1.70985 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Player (Intercept) 0.2951   0.5432  
## Number of obs: 475, groups:  Player, 475
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.07548    0.02891    37.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We used the glmer function in the lme4 package to implement the above model. We obtain an estimate for \(logit(\mu_p)\) with the “(Intercept)” parameter, and it is 0.7456375 on the probability scale. We also see a maximum likelihood estimate for \(\sigma_p\) under the random effects section: 0.5431875.

Let’s visualize the new estimates:

# get estimated probabilities for each player from m2
shrunken_probs <- plogis(fixef(m2) + unlist(ranef(m2)))

# match these to the player names, 
# from the row names of the ranef design matrix
shrunken_names <- m2@pp$Zt@Dimnames[[1]]

ranef_preds <- data.frame(Player = shrunken_names, 
                          p_shrink = shrunken_probs)

# join the raw data with the model output
joined <- full_join(d, ranef_preds)
## Joining by: "Player"
# difference between naive & shrunken MLEs
joined$diff <- joined$ft_pct - joined$p_shrink

# plot naive MLE vs. shrunken MLE
ggplot(joined, aes(x=ft_pct, y=p_shrink, color=ft_shot)) + 
  geom_point(shape=1) + 
  scale_color_gradientn(colours=rainbow(4)) +
  geom_abline(yintercept=0, slope=1, linetype='dashed') +
  xlab('Naive MLE') + 
  ylab('Shrunken MLE')

# using facets instead of colors
joined$ft_group <- cut(joined$ft_shot, 9)
ggplot(joined, aes(x=ft_pct, y=p_shrink)) + 
  geom_abline(yintercept=0, slope=1, linetype='dashed', alpha=.7) +
  geom_point(shape=1, alpha=.5, size=1) + 
  facet_wrap(~ ft_group) +
  xlab('Naive MLE') + 
  ylab('Shrunken MLE')

# view difference in estimates as a function of freethrows shot
ggplot(joined, aes(x=ft_shot, y=diff)) + 
  geom_point(shape=1) + 
  xlab("Free throw attempts") +
  ylab("Naive MLE - Shrunken MLE")

The estimates from the hierarchical model differ from the MLE estimates obtained in our first model. In particular, the estimates that are imprecise (e.g., players with few attempted shots) are shrunken towards the grand mean. This shrinkage is highly desirable, and is consistent with the idea that we have increasing trust in estimates that are informed by more data.

What about the NBA’s magic number of 125? Do we find that estimates are still undergoing shrinkage beyond this range, or are the empirical free throw percentages reliable if players have taken 125 or more shots?

joined %>%
  filter(ft_shot >= 125) %>%
  ggplot(aes(x=ft_shot, y=diff)) + 
  geom_point(shape=1) + 
  xlab("Free throw attempts") +
  ylab("Naive MLE - Shrunken MLE")

It looks like there are slight differences between the naive estimates (empirical proportions) and the probabilities that we estimate with partial pooling. We might conclude that the 125 shot threshold could give estimates that are reliable to plus or minus 2 or 3 percentage points based on the above graph.

So, which player do we think is the best and the worst?

joined[which.max(joined$p_shrink), ]
## Source: local data frame [1 x 9]
## 
##          Player ft_made ft_miss ft_shot    ft_pct       shooter  p_shrink
##          (fctr)   (int)   (int)   (int)     (dbl)        (fctr)     (dbl)
## 1 Stephen Curry     308      29     337 0.9139466 Stephen Curry 0.9023947
## Variables not shown: diff (dbl), ft_group (fctr)
joined[which.min(joined$p_shrink), ]
## Source: local data frame [1 x 9]
## 
##        Player ft_made ft_miss ft_shot    ft_pct     shooter  p_shrink
##        (fctr)   (int)   (int)   (int)     (dbl)      (fctr)     (dbl)
## 1 Joey Dorsey      24      59      83 0.2891566 Joey Dorsey 0.3570848
## Variables not shown: diff (dbl), ft_group (fctr)

Congrats Steph Curry (for this accomplishment and winning the NBA title), and our condolences to Joey Dorsey, who as of 2015 is playing in the Turkish Basketball League.

Partial, complete, and no pooling

Partial pooling is often presented as a compromise between complete pooling (in the above example, combining data from all players and estimating one NBA-wide \(p\)), and no pooling (using the observed proportion of free-throws made). This can be formalized by considering what happens to the parameter-level model in these three cases. With no pooling, the among-group (e.g., player) variance parameter approaches infinity, such that \(p_i \sim N(\mu_p, \sigma_p): \sigma_p \rightarrow \infty\). With complete pooling, the among-group variance parameter is zero, so that all groups recieve the group-level mean. With partial pooling, the estimation of \(\sigma_p\) leads to a data-informed amount of shrinkage.

Multiple levels, nestedness, and crossedness

In the previous example we had two levels: within and among players. In hierarchical modeling, it is often beneficial to include information about the units at each level. For instance, if the data were broken down shot-by-shot, then within a player, it might be relevant to include number of minutes played as a covariate, with the expectation that players who have been in a game longer might be fatigued and shoot with less accuracy. At the among-player level, we know that players played on different teams. Perhaps there are systematic differences among teams in free-throw shooting ability, for instance because they have a coaching staff that consistently emphasize free-throw shooting, or scouts that recruit players who are particularly good at shooting free-throws. We can expand the previous model to include team effects as follows:

\[y_i \sim Binom(p_i, k_i)\]

\[logit(p_i) = p_0 + \pi_i + \tau_{j[i]}\]

\[\pi_i \sim N(0, \sigma_\pi)\]

\[\tau_j \sim N(0, \sigma_\tau)\]

so that the likelihood to maximize is:

\[[y_i \mid p_i] [p_i \mid p_0, \pi_i, \tau_i] [\pi_i \mid \sigma_\pi] [\tau_j \mid \sigma_\tau]\]

Here, \(p_0\) is an intercept parameter that represents the mean logit probability of making a free throw across all players and teams. Player effects are represented by \(\pi_i\), and team effects are represented by \(\tau_{j[i]}\), with subscript indexing to represent that player \(i\) belongs to team \(j\).

Note that not all players play for all teams, that is, players are not “crossed” with teams. Most players, specifically those that only played for one team, are nested within teams. However, because some players switched teams partway through the season, these players will show up on different teams. There is often a lot of confusion around nestedness in these types of models. We point out here that nestedness is a feature of the data, not a modeling decision. There are cases when the data are nested but structured poorly so that extra work is required to adequately represent the nestedness of the data. For instance, if we had data with measurements from 5 regions, each with 3 sites, and the blocks were always labeled 1, 2, or 3, then our data might look like this:

##    region site
## 1       1    1
## 2       1    2
## 3       1    3
## 4       2    1
## 5       2    2
## 6       2    3
## 7       3    1
## 8       3    2
## 9       3    3
## 10      4    1
## 11      4    2
## 12      4    3
## 13      5    1
## 14      5    2
## 15      5    3

But, this is misleading. The observations corresponding to site 1 are actually 5 different sites, occuring in 5 different regions. A more accurate data structure would be:

##    region site
## 1       1    1
## 2       1    2
## 3       1    3
## 4       2    4
## 5       2    5
## 6       2    6
## 7       3    7
## 8       3    8
## 9       3    9
## 10      4   10
## 11      4   11
## 12      4   12
## 13      5   13
## 14      5   14
## 15      5   15

Fitting a 3 level model with lme4

Returning to the example with player and team effects, both drawn from a prior distribution with hyperparameters \(\sigma_\pi\) and \(\sigma_\tau\) to be estimated:

d <- rawd %>%
  group_by(Player, Tm, Age, Pos) %>%
  summarize(ft_made = sum(FT), 
            ft_miss = sum(FTA) - sum(FT),
            ft_shot = sum(FTA), 
            ft_pct = sum(FT) / sum(FTA)) %>%
  subset(ft_shot != 0) %>%
  arrange(-ft_pct) %>%
  droplevels()
d
## Source: local data frame [626 x 8]
## Groups: Player, Tm, Age [626]
## 
##           Player     Tm   Age    Pos ft_made ft_miss ft_shot    ft_pct
##           (fctr) (fctr) (int) (fctr)   (int)   (int)   (int)     (dbl)
## 1   Aaron Brooks    CHI    30     PG     145      29     174 0.8333333
## 2   Aaron Gordon    ORL    19     PF      44      17      61 0.7213115
## 3  Adreian Payne    ATL    23     PF       1       1       2 0.5000000
## 4  Adreian Payne    MIN    23     PF      29      15      44 0.6590909
## 5  Adreian Payne    TOT    23     PF      30      16      46 0.6521739
## 6     A.J. Price    CLE    28     PG       4       2       6 0.6666667
## 7     A.J. Price    IND    28     PG      12       6      18 0.6666667
## 8     A.J. Price    TOT    28     PG      16       8      24 0.6666667
## 9  Alan Anderson    BRK    32     SG      82      19     101 0.8118812
## 10    Alec Burks    UTA    23     SG     106      23     129 0.8217054
## ..           ...    ...   ...    ...     ...     ...     ...       ...
m <- glmer(cbind(ft_made, ft_miss) ~ (1|Player) + (1|Tm), 
         family=binomial, data=d)
summary(m)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(ft_made, ft_miss) ~ (1 | Player) + (1 | Tm)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##   3899.1   3912.5  -1946.6   3893.1      623 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.17015 -0.30554  0.01832  0.30808  1.91060 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Player (Intercept) 0.2951   0.5432  
##  Tm     (Intercept) 0.0000   0.0000  
## Number of obs: 626, groups:  Player, 475; Tm, 31
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.07548    0.02891    37.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pr <- profile(m)
confint(pr)
##                 2.5 %     97.5 %
## .sig01      0.5002122 0.59097299
## .sig02      0.0000000 0.06070247
## (Intercept) 1.0187033 1.13228412

So, it appears that the MLE for the variance attributable to teams is zero. Profiling the likelihood, we see that there may be a little bit of variance attributable to teams, but we may be better off discarding the team information and instead including information about the player’s positions. The logic here is that we expect players to vary in their shooting percentages based on whether they are guards, forwards, centers, etc.

m2 <- glmer(cbind(ft_made, ft_miss) ~ (1|Player) + (1|Pos), 
            family=binomial, data=d)
summary(m2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(ft_made, ft_miss) ~ (1 | Player) + (1 | Pos)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##   3851.9   3865.2  -1923.0   3845.9      623 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.16535 -0.32463  0.02557  0.32095  1.87141 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Player (Intercept) 0.23943  0.4893  
##  Pos    (Intercept) 0.03869  0.1967  
## Number of obs: 626, groups:  Player, 475; Pos, 11
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.09854    0.08041   13.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(m, m2)
##    df      AIC
## m   3 3899.147
## m2  3 3851.925

So, m2 seems to be better in terms of AIC, but one could argue that m2 will also be more useful for predictive applications. For instance, if we wanted to predict the free throw shooting ability of a new player, we could get a more precise estimate with m2 because we could use information about their position (if it was known). In contrast, in model m, we would predict the same \(p\) regardless of position, because position is not in the model.

Level-specific covariates

In hierarchical models, covariates can be included at specific levels. For instance, at the player level, we might expect that age has an impact on free throw shooting ability. Possibly, players peak at some age, and then start to go downhill as they age. We can represent this with a second degree polynomial effect of age. Trying this:

m3 <- glmer(cbind(ft_made, ft_miss) ~ Age + I(Age^2) + (1|Player) + (1|Pos), 
            family=binomial, data=d)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge with max|grad| = 0.0280989 (tol =
## 0.001, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
summary(m3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(ft_made, ft_miss) ~ Age + I(Age^2) + (1 | Player) + (1 |  
##     Pos)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##   3838.1   3860.3  -1914.0   3828.1      621 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.99299 -0.33689  0.02006  0.33940  1.88029 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Player (Intercept) 0.22661  0.4760  
##  Pos    (Intercept) 0.04262  0.2065  
## Number of obs: 626, groups:  Player, 475; Pos, 11
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.956747   0.952742  -2.054  0.03999 * 
## Age          0.206198   0.069517   2.966  0.00302 **
## I(Age^2)    -0.003347   0.001252  -2.674  0.00749 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) Age   
## Age      -0.993       
## I(Age^2)  0.981 -0.996
## convergence code: 0
## Model failed to converge with max|grad| = 0.0280989 (tol = 0.001, component 1)
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?

We get a warning about convergence resulting from numeric issues, with the recommendation to rescale our variables. Recall that unscaled continuous covariates tend to cause correlations between the estimates of slopes and intercepts, which is apparent in the Correlation of Fixed Effects section. Rescaling age does the trick in this example:

d$age <- (d$Age - mean(d$Age)) / sd(d$Age)
m3 <- glmer(cbind(ft_made, ft_miss) ~ age + I(age^2) + (1|Player) + (1|Pos), 
            family=binomial, data=d)
summary(m3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(ft_made, ft_miss) ~ age + I(age^2) + (1 | Player) + (1 |  
##     Pos)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##   3838.1   3860.3  -1914.0   3828.1      621 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.99300 -0.33690  0.02006  0.33940  1.88029 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Player (Intercept) 0.22661  0.4760  
##  Pos    (Intercept) 0.04263  0.2065  
## Number of obs: 626, groups:  Player, 475; Pos, 11
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.15916    0.08597  13.484  < 2e-16 ***
## age          0.11720    0.02843   4.122 3.76e-05 ***
## I(age^2)    -0.05722    0.02143  -2.671  0.00757 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) age   
## age       0.109       
## I(age^2) -0.251 -0.396
AIC(m, m2, m3)
##    df      AIC
## m   3 3899.147
## m2  3 3851.925
## m3  5 3838.078

This model does seem to receive more support than the other two models. We can visualise the age result as follows:

lo <- 100
new_age <- seq(min(d$age), max(d$age), length.out=lo)
X <- matrix(c(rep(1, 100), new_age, new_age^2), nrow=lo)
logit_p <- X %*% fixef(m3)
scaled_age <- new_age * sd(d$Age) + mean(d$Age) 
plot(scaled_age, plogis(logit_p), type='l', 
     xlab="Age", ylab="p")

There are also packages that are designed to aid in visualization of the output of lmer and glmer objects. Handy plots here include a sorted caterpillar plot of the random effects:

library(sjPlot)
library(arm)
sjp.glmer(m3, ri.nr = 1, sort = "(Intercept)")

sjp.glmer(m3, ri.nr=2, sort = "(Intercept)")
## Plotting random effects...

We assumed that the random effects are normally distributed on a logit scale. This assumption can be checked with a q-q plot:

sjp.glmer(m3, type = "re.qq", facet.grid=F)
## Testing for normal distribution. Dots should be plotted along the line.

Last, we should do a sanity check for our estimated probabilities. One approach is to reproduce the shrinkage plots that we constructed for our first varying intercept model.

d$estimated_p <- fitted(m3)
d$diff <- d$ft_pct - d$estimated_p
ggplot(d, aes(x=ft_shot, y=diff)) + 
  geom_point(shape=1) + 
  xlab("Free throw attempts") +
  ylab("Naive MLE - Shrunken MLE")

ggplot(d) + 
  geom_segment(aes(x="Naive MLE", xend="Shrunken MLE", 
                   y=ft_pct, yend=estimated_p, group=Player), 
               alpha=.3) + 
  xlab('Estimate') + 
  ylab('Pr(make freethrow)')

ggplot(d, aes(x=ft_shot, y=estimated_p)) + 
  geom_point(shape=1) + 
  xlab("Free throw attempts") +
  ylab("Shrunken MLE")

Who does this model identify as the worst?

d[which.min(d$estimated_p), 
  c("Player", "Age", "Pos", "ft_shot", "ft_pct", "estimated_p")]
## Source: local data frame [1 x 6]
## 
##        Player   Age    Pos ft_shot    ft_pct estimated_p
##        (fctr) (int) (fctr)   (int)     (dbl)       (dbl)
## 1 Ian Mahinmi    28      C     102 0.3039216   0.3652981

Which player might be best?

d[which.max(d$estimated_p), 
  c("Player", "Age", "Pos", "ft_shot","ft_pct", "estimated_p")]
## Source: local data frame [1 x 6]
## 
##          Player   Age    Pos ft_shot    ft_pct estimated_p
##          (fctr) (int) (fctr)   (int)     (dbl)       (dbl)
## 1 Stephen Curry    26     PG     337 0.9139466   0.9023958

Further reading

Gelman and Hill. 2009. Data analysis using regression and multilevel/hierarchical models. Chapter 11, 12.

Gelman et al. 2014. Bayesian data analysis, 3rd edition. Chapter 5.

Efron, Bradley, and Carl N. Morris. Stein’s paradox in statistics. WH Freeman, 1977.

Gelman, Andrew, Jennifer Hill, and Masanao Yajima. “Why we (usually) don’t have to worry about multiple comparisons.” Journal of Research on Educational Effectiveness 5.2 (2012): 189-211.